[SPARK-16371][SQL] Do not push down filters incorrectly when inner name and outer name are the same in Parquet#14067
[SPARK-16371][SQL] Do not push down filters incorrectly when inner name and outer name are the same in Parquet#14067HyukjinKwon wants to merge 4 commits intoapache:masterfrom
Conversation
|
Hi, @rxin @liancheng, I hope this is not missed to 2.0.. |
|
cc @viirya as well. |
|
LGTM pending Jenkins. 2.0.0 RC2 has already been cut. We may have this in 2.0.0 if there was another RC. |
|
Oh, @liancheng I just corrected some more. Please take another look.. (sorry) |
| !f.metadata.contains(StructType.metadataKeyForOptionalField) || | ||
| !f.metadata.getBoolean(StructType.metadataKeyForOptionalField) | ||
| }.map(f => f.name -> f.dataType) ++ fields.flatMap { f => getFieldMap(f.dataType) } | ||
| }.map(f => f.name -> f.dataType) |
There was a problem hiding this comment.
Could you please add some comment here?
|
The description seems incorrect. It should be a StringType, instead of IntegerType? |
|
yes it is not, I just found. I will correct them all. Thank you! |
|
Another question is when the inner field is not the same name, we still can push down the filter, right? |
|
If so, then this patch seems completely skip all such push down. |
|
Test build #61841 has finished for PR 14067 at commit
|
|
Test build #61840 has finished for PR 14067 at commit
|
|
Test build #61842 has finished for PR 14067 at commit
|
|
Test build #61843 has finished for PR 14067 at commit
|
|
Yea, currently Spark SQL doesn't support column pruning and/or filter push-down for nested fields. |
|
OK. LGTM then. |
| } | ||
| } | ||
|
|
||
| test("Do not push down filters incorrectly when inner name and outer name are the same") { |
|
I'm going to merge this and fix some comments myself with another pr. Merging in master/2.0. |
…me and outer name are the same in Parquet
## What changes were proposed in this pull request?
Currently, if there is a schema as below:
```
root
|-- _1: struct (nullable = true)
| |-- _1: integer (nullable = true)
```
and if we execute the codes below:
```scala
df.filter("_1 IS NOT NULL").count()
```
This pushes down a filter although this filter is being applied to `StructType`.(If my understanding is correct, Spark does not pushes down filters for those).
The reason is, `ParquetFilters.getFieldMap` produces results below:
```
(_1,StructType(StructField(_1,IntegerType,true)))
(_1,IntegerType)
```
and then it becomes a `Map`
```
(_1,IntegerType)
```
Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it is actually `StructType`.
So, Parquet filter2 produces incorrect results, for example, the codes below:
```
df.filter("_1 IS NOT NULL").count()
```
produces always 0.
This PR prevents this by not finding nested fields.
## How was this patch tested?
Unit test in `ParquetFilterSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #14067 from HyukjinKwon/SPARK-16371.
(cherry picked from commit 4f8ceed)
Signed-off-by: Reynold Xin <rxin@databricks.com>

What changes were proposed in this pull request?
Currently, if there is a schema as below:
and if we execute the codes below:
df.filter("_1 IS NOT NULL").count()This pushes down a filter although this filter is being applied to
StructType.(If my understanding is correct, Spark does not pushes down filters for those).The reason is,
ParquetFilters.getFieldMapproduces results below:and then it becomes a
MapNow, because of
....lift(dataTypeOf(name)).map(_(name, value)), this pushes down filters for_1which Parquet thinks isIntegerType. However, it is actuallyStructType.So, Parquet filter2 produces incorrect results, for example, the codes below:
produces always 0.
This PR prevents this by not finding nested fields.
How was this patch tested?
Unit test in
ParquetFilterSuite.